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Abstract 

Nonnegative matrix factorization (NMF) was popularized as a tool 
for data mining by Lee and Seung in 1999. NMF attempts to approx- 
imate a matrix with nonnegative entries by a product of two low-rank 
matrices, also with nonnegative entries. We propose an algorithm 
called rank-one downdate (RID) for computing a NMF that is partly 
motivated by singular value decomposition. This algorithm computes 
the dominant singular values and vectors of adaptively determined 
submatrices of a matrix. On each iteration, RID extracts a rank-one 
submatrix from the dataset according to an objective function. We 
establish a theoretical result that maximizing this objective function 
corresponds to correctly classifying articles in a nearly separable cor- 
pus. We also provide computational experiments showing the success 
of this method in identifying features in realistic datasets. 
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1 Nonnegative Matrix Factorization 



Several problems in information retrieval can be posed as low-rank matrix 
approximation. The seminal paper by Deerwester et al. [5] on latent semantic 
indexing (LSI) showed that approximating a term-document matrix describ- 
ing a corpus of articles via the SVD led to powerful query and classification 
techniques. A drawback of LSI is that the low-rank factors in general will 
have both positive and negative entries, and there is no obvious statistical 
interpretation of the negative entries. This led Lee and Seung [13] among 
others to propose nonnegative matrix factorization, that is, approximation 
of a matrix A e R mx as a product of two factors WH T , where W € R mxfc , 
H e R™ xfc , both have nonnegative entries, and k < min(m, n). Lee and 
Seung showed intriguing results with a corpus of images. In a related work, 
Hofmann [11] showed the application of NMF to text retrieval. Nonnegative 
matrix factorization has its roots in work of Gregory [10], Paatero [14] and 
Cohen and Rothblum [4]. 

Since the problem is NP-hard [18], it is not surprising that no algorithm 
is known to solve NMF to optimality. Heuristic algorithms proposed for 
NMF have generally been based on incrementally improving the objective 
|| A — WH T \\ in some norm using local moves. A particularly sophisticated 
example of local search is due, e.g., to Kim and Park [12]. A drawback 
of local search is that it is sensitive to initialization and also is sometimes 
difficult to establish convergence. 

We propose an NMF method based on greedy rank-one downdating that 
we call RID. RID is partly motived by Jordan's algorithm for computing 
the SVD, which is described in Section 2. Unlike local search methods, 
greedy methods do not require an initial guess. In Section 3, we compare 
our algorithm to Jordan's SVD algorithm, which is the archetypal greedy 
downdating procedure. Previous work on greedy downdating algorithms for 
NMF is the subject of Section 4. In Section 5, we present the main theoretical 
result of this paper, which states that in a certain model of text due to 
Papadimitriou et al. [15], optimizing our objective function means correctly 
identifying a topic in a text corpus. Similarly, optimization of the objective 
function corresponds to identifying a feature in a certain model of an image 
database, as demonstrated in Section 6. We then turn to computational 
experiments: in Section 8, we present results for RID on image databases, 
and in Section 9, we present results on text. 
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2 Algorithm and Objective Function 



Rank-one downdate (RID) is based on the simple observation that the lead- 
ing singular vectors of a nonnegative matrix are nonnegative. This is a con- 
sequence of the Perron- Frobenius theorem [9] . Based on this observation, it 
is trivial to compute rank-one NMF. This idea can be extended to approxi- 
mate higher order NMF. Suppose we compute the rank-one NMF and then 
subtract it from the original matrix. The original matrix will no longer be 
nonnegative, but all negative entries can be forced to be zero or positive and 
the procedure can be repeated. 

An improvement on this idea takes only a submatrix of the original matrix 
and applies the Perron-Frobenius theorem. The point is that taking the 
whole matrix will in some sense average the features, whereas a submatrix 
can pick out particular features. A second point of taking a submatrix is 
that a correctly chosen submatrix may be very close to having rank one, so 
the step of forcing the residuals to being zero will not introduce significant 
inaccuracy (since they will already be close to zero). 

The outer loop of the RID algorithm is as follows. 

function [W, H] = R1D(A, k) 
Inputs: A G R mx ", k > 0. 
Outputs: W G R mxfc , H G R nxfc . 

(1) for pi — 1, . . . , k 

(2) [M, N, u, v, a] = ApproxRankOneSubmatrix (A) ; 

(3) W(M,fj,) = u(M). 

(4) H(N,») = <rv(N). 

(5) A(M,N)=0. 

(6) end for 

Here, M is a subset of {1, . . . , m}, iV is a subset of {1, ... , n}, u G R m , v G 
R n and a G R, and u, v are both unit vectors. We follow Matlab subscripting 
conventions, so that u(M) denotes the subvector of u indexed by M. In 
the above algorithm, u({l, . . . , m} — M) = and v({l, . . . , n} — N) = 0. 
The function ApproxRankOneSubmatrix selects M, N, u(M), v(iV), a so that 
A(M,N) (i.e., the submatrix of A indexed by row set M and column set 
N) is approximately rank one, and in particular, is approximately equal to 
ui.Uiov'S.Y!. 

This outer loop for NMF may be called "greedy rank-one downdating" 
since it greedily tries to fill the columns of W and H from left to right by 
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finding good rank-one submatrices of A and subtracting them from A. The 
classical greedy rank-one downdating algorithm is Jordan's algorithm for the 
SVD, described in Section 3. Related work on greedy rank-one downdating 
for NMF is the topic of Section 4. 

The subroutine ApproxRankOneSubmatrix, presented later in this section, 
is a heuristic routine to maximize the following objective function: 



f(M, N, u, a, v) = \\A(M, N)\\ 2 F - 7 ||A(M, N) - u(M)av(N) T \\ 2 F . (1) 



Here, 7 is a penalty parameter. The Frobenius norm of an m x n matrix B, 
denoted \\B\\ F , is defined to be y/B(l, l) 2 + B(l, 2) 2 + • • • + B(m, n) 2 . The 
rationale for (1) is as follows: the first term in (1) expresses the objective 
that A{M, N) should be large, while the second term penalizes departure of 
A(M, N) from being a rank-one matrix. 

Since the optimal u, a, v come from the SVD (once M, N are fixed), the 
above objective function can be rewritten just in terms of M and iV as 



where p = min(]M|, \N\). The penalty parameter 7 should be greater than 
1 so that the presence of low-rank contributions is penalized rather than 
rewarded. 

We conjecture that maximizing (1) is NP-hard (see Section 7), so we in- 
stead propose a heuristic routine for optimizing it. The procedure alternates 
improving (v, N) and (u, M). The rationale for this alternation is that for 
fixed (v,N), the objective function (1) is separable by rows of the matrix. 
Similarly, for fixed (u, M), the objective function is separable by columns. 
Let us state and prove this as a lemma. 

Lemma 1. Let (v, N) be the optimizing choice of these variables in (1). 
Then the optimal M is determined as follows. Define 



p p 



f(M, N) 



*i(A(M, iV)) 2 - 7 E ^( M > iV )) 2 




^(M, iV)) 2 -(7-1) 

■(a 2 (A(M,N)) 2 + ... + a p (A(M,N)) 2 ), 



(2) 



9i = - 



A(i, N)A(i, N) T + 7(A(i, A^)v(A")) 2 , 



(3) 



where 



7 = 7/(7 - !)■ 



(4) 
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Then i G M if gi > 0. (If exact equality Qi = holds, then including i or 
not does not affect optimality.) Furthermore, aui is optimally chosen to be 
A(i,N)v(N). 

Remark 1. The lemma gives the formula for optimal aui for each i, i.e., 
the formula for the optimal cxu(M). To obtain a formula for optimal u and 
a separately, we define a := \\o~u(M)\\ and u(M) := <7u(M)/||<7u(M)||. 
Remark 2. Assuming instead that the optimizing choice (u, M) is given, 
there is a similar formula for determining membership in N. Define 

fj = -A(M,j) t A(M,j) +7(A(M,j) T u(M)) 2 , (5) 

and take N = {j : f) > 0}. 
Proof. Observe that 

m 

f(M, N,u,a,v) = $>m« (P(*,iV)|| 2 -l\\A(i,N) - Av(iV) T || 2 ) , 

where fa = aui and Xm(0 = 1 for i G M and xm(«) = for i M. Observe 
that Pi occurs only in the ith term of the above summation, hence assuming 
v and iV are optimal, each term may be optimized separately. The optimal Pi 
(that is, the minimizer of \\A{i, N) - Piv(N) T \\) is A(i, N)v(N), the solution 
to a simple linear least-squares minimization. Thus, we conclude that putting 
row % into index set M is improves the objective function if and only if gi > 0, 
where 

9i = \\A(i, N) || 2 - 7 \\A(i, N) - A(i, N)v(N)v(N) T f. 
The formula for ^ can be simplified as follows: 

9i = A(i, N)A(i, N) T 

- 7 (A(i, N) - A(i, iV)v(A^)v(A^) T )(A(i, N) - A(i, Ar)v(A^)v(iV) T ) T 
= -( 7 - l)A(i, N)A(i, Nf + >y(A(i, N)v(N)) 2 . 

Rescaling by 7 — 1 (which does not affect the acceptance criterion) and sub- 
stituting (4), we that row % makes a positive contribution to the objective 
function provided i(A(i, N)v(N)) 2 - A(i, N)A(i, N) T > 0. □ 

The next issue is choice of starting guess for M, N, u, v, a. The algorithm 
should be initialized with a starting guess that has a positive score, else the 
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rules for discarding rows and columns could conceivable discard all rows or 
columns. More strongly, in order to improve the score of converged solution, 
it seems sensible to select a starting guess with a high score. For this reason, 
RID uses as its starting guess a single column of A, and in particular, the 
column of A with the greatest norm. (A single row may also be chosen.) 
It then chooses u to be the normalization of this column. This column is 
exactly rank one, so for the correct values of o and v the first penalty term 
of (1) is zero. We have derived the following algorithm for the subroutine 
ApproxRankOneSubmatrix occurring in statement (2) in RID. 

function [M, N, u, v, a] = ApproxRankOneSubmatrix (A); 
Input: A e R mxn . 

Outputs: M C {1, . . . , m}, N C {1, . . . , n}, u e R m , v6R>GR. 
Parameter: 7 > 1 

(1) Select jo G {1, ... to maximize \\A(:, j )\\. 

(2) M = {!,... ,m}- 

(3) N = { Jo }. 

(4) a=\\A(:,j )\\. 

(5) u = A(:, 3o )/a. 

(6) Repeat 

(7) v = A(M,:) T u(M). 

(8) N = {j:^v(j) 2 -\\A(M,j)\\ 2 >0}. 

(9) v(7V) = v(iV)/||v(iV)||. /* Other entries of v unused */ 

(10) u = A(:,N)v(N). 

(11) M = {t:^u(i) 2 -\\A(i,N)\\ 2 >0}. 

(12) a=\\u(M)\\. 

(13) u(M) = u(M)/a. /* Other entries of u unused */ 

(14) until stagnation in M, N,u,v, a. 

The 'Repeat' loop is guaranteed to make progress because each iteration 
increases the value of the objective function. On the other hand, there does 
not seem to be any easy way to derive a useful prior upper bound on its 
number of iterations. In practice, it proceeds quite quickly, usually converg- 
ing in 10-15 iterations. But to guarantee fast termination, monotonicity can 
be forced on M and iV by requiring M to shrink and iV to grow. In other 
words, statement (8) can be replaced by 

N = NU{j:^v(j) 2 -\\A(M,j)\\ 2 >0}, 
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and statement (11) by 



M = M - {i : ^u{if - \\A(i,N)\\ 2 <0}. 

Our experiments indicate that this change does not have a major impact on 
the performance of RID. 

Another possible enhancement to the algorithm is as follows: we modify 
the objective function by adding a second penalty term 

-p\M\ ■ \N\ (6) 

to (1) where p > is a parameter. The purpose of this term is to penalize 
very low-norm rows or columns from being inserted into A(M, N) since they 
are probably noisy. For data with larger norm, the first term of (1) should 
dominate this penalty. Notice that this penalty term is also separable so it 
is easy to implement: the formula in (8) is changed to ^yv(j) 2 — \\A(M ) j)\\ 2 — 
p\M\ > while the formula in (11) becomes ^u{if - \\ A(i, N) \\ 2 - p\N\ > 0, 
where p = p/(7 — 1). We may select p so that the third term is a small 
fraction (say fj = 1/20) of the other terms in the initial starting point. This 
leads to the following definition for p: 

P = v(l~ l)v 2 /m, 
which may be computed immediately after (4). 

3 Relationship to the SVD 

The classical rank-one greedy downdating algorithm is Jordan's algorithm 
for computing the singular value decomposition (SVD) [17]. Recall that the 
SVD takes as input anmxn matrix A and returns three factors U,H,V such 
that U G R mxfe and U has orthonormal columns (i.e., U T U — I), E e R fcxfc 
and is diagonal with nonnegative diagonal entries, and V G R nxfe also with 
orthonormal columns, such that UYA/ T is the optimal rank-/c approximation 
to A in either the 2-norm or Frobenius norm. (Recall that the 2-norm of an 
m x n matrix B, denoted ||-B||2, is defined to be a/ X max (B T B) , where A max 
denotes the maximum eigenvalue.) 

[U, E, V] = JordanSVD(A, fc); 

Input: A G R mxn and k < min(m, n). 



7 



Outputs: U, E, V as above. 

( 1 ) for /j, — 1 , . . . , k 

(2) Select a random nonzero u e R m . 

(3) <t=||u||- 

(4) u = u/a. 

(5) Repeat /* power method */ 

(6) v = A T u. 

(7) v = v/||v||. 

(8) u = Av. 

(9) cr = ||ti||. 

(10) u = u/a. 

(11) until stagnation in u, a, v. 

(12) A = A-uav T ; 

(13) U(:,fi) = u; 

(14) f(:,/i) = v; 

(15) E( A i, A i)=<r; 

(16) end for 

Thus, we see that RID is quite similar to the SVD. The principal differ- 
ence is that RID tries to find a sub matrix indexed by M x N at the same 
time that it tries to identify the optimal u and v. Because of this sim- 
ilarity, the formulas for u and v occurring in (9) and (13) of subroutine 
ApproxRankOneSubmatrix, which were presented earlier as solutions to a 
least-squares problem, may also be regarded as steps in a power method. In 
fact, if M and iV are fixed, then the inner Repeat-loop of that subroutine 
will indeed converge to the dominant singular triple of A(M,N). 

As noted earlier, use of the SVD on term-document matrices dates back 
to latent semantic indexing due to Deerwester et al. [5]. Its effectiveness at 
creating a faithful low-dimensional model of a corpus in the case of separable 
corpora was established by Papadimitriou et al. [15]. Although not originally 
proposed specifically as a clustering tool, the SVD has been observed to find 
good clusters in some settings [6]. 

The SVD, however, has a significant shortcoming as far as its use for 
clustering. Consider the following term-document matrix A, which is a sum 
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of a completely separable matrix B and noise matrix E: 
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It should be clear that there are two separate topics in A given by the two 
diagonal blocks, and a reasonable NMF algorithm ought to be able to identify 
the two blocks. In other words, for k = 2, one would expect an answer close 
to 



W = H 



Perhaps unexpectedly, the dominant right singular vector of A is very close 
to being proportional to [1; 1; 1; 1], i.e., the two topics are entangled in one 
singular vector. The reason for this behavior is that the matrix B has two 
nearly equal singular values, so its singular vectors are highly sensitive to 
small perturbations (such as the matrix E). RID avoids this pitfall by com- 
puting the dominant singular vector of a submatrix of the original A instead 
of the whole matrix. 
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4 Related Work 

As mentioned in the introduction, most algorithms proposed in the literature 
are based on forming an initial W and H and then improving them by local 
search on an objective function. The objective function usually includes a 
term of the form || A — WH T \\ in some norm, and may include other terms. 

A few previous works follow an approach similar to ours, namely, greedy 
subtraction of rank-one matrices. This includes the work of Bergmann et 
al. [2], who identify the rank-one matrix to subtract as the fixed point of an 
iterative process. Asgarian and Greiner [1] find the dominant singular pair 
and then truncate it. Gillis [8] finds a rank-one underestimator and subtracts 
that. Boutsidis and Gallopoulos [3] consider the use of a greedy algorithm for 
initializing other algorithm and make the following interesting observation: 
The nonnegative part of a rank-one matrix has rank at most 2. 
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The main innovation herein is the idea that the search for the rank-one 
submatrix should itself be an optimization subproblem. This observation al- 
lows us to compare and rank one candidate submatrix to another. (Gillis also 
phrases his subproblem as optimization, although his optimization problem 
does not explicitly seek submatrices like ours.) A second innovation is our 
analysis showing in Section 5 that if the subproblemn were solved optimally, 
then RID would be able to accurately find the topics in the Papadimitriou 
et al. [15] model of e-separable corpora. 

5 Behavior of this objective function on a 
nearly separable corpus 

In this section, we establish the main theoretical result of the paper, namely, 
that the objective function given by (1) is able to correctly identify a topic 
in a nearly separable corpus. We define our text model as follows. There is a 
universe of terms numbered 1, . . . , m. There is also a set of topics numbered 
1, . . . , t. Topic k, for k = 1, . . . , t, is a probability distribution over the terms. 
Let P(i, k) denote the probability of term i occurring in topic k. Thus, P is 
a singly stochastic matrix, i.e., it has nonnegative entries with column sums 
exactly 1. We assume also that there is a probability distribution over topics; 
say the probability of topic k is r^, for k = 1, . . . , t. The text model is thus 
specified by P and r±, . . . , r t . We use the Zipf distribution as the model of 
document length. In particular, there is a number L such that all documents 
have length less than L, and the probability that a document of length I 
occurs is 

1 + 1/2 + ••• + !/(£- 1)' 
We have checked that the Zipf model is a good fit for several common 
datasets. [SHOW SOME DATA.] 

A document is generated from this text model as follows. First, topic 
k is chosen at random according to the probability distribution {tl, . . . , r t }. 
Then, a length I is chosen at random from {1, . . . , L— 1} according to the Zipf 
distribution. Finally, the document itself is chosen at random by selecting 
/ terms independently according to the probability distribution P(:,k). A 
corpus is a set of n documents chosen independently using this text model. 
Its term-document matrix is the m x n matrix A such that A(i,j) is the 
frequency of term i in document j. 
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We further assume that the text model is e-separable, meaning that each 
topic k is associated with a set of terms T k C {1, . . . , m}, that 7\, . . . , T t are 
mutually disjoint, and that P(i,k) < e for % T k , i.e., the probability that 
a document on topic k will use a term outside of T k is small. Parameter e 
must satisfy some inequalities described below. This corpus model is quite 
similar to the model of Papadimitriou et al. [15]. One difference is in the 
the document length model. Our model also relaxes several assumptions of 
Papadimitriou et al. 

Our main theorem is that the objective function in the previous section 
can correctly find documents associated with a particular topic in a corpus. 

Theorem 2. Let (P, (ti, . . . , Tt)) specify a text model, and let a > be 
chosen arbitrarily. Suppose there exists an e > satisfying (15) below such 
that the text-model is e-separable with respect to Ti, . . . ,T t , the subsets of 
terms defining the topics. Let A be the term-document matrix of a corpus of 
n documents drawn from this model when the document-length parameter is 
L. Choose 7 = 4 in (1). Then with probability tending to 1 as n — > oo and 
L — > oo (refer to Assumption Al below), the optimizing pair (M,N) of (1) 
satisfies the following. Let Di, . . . , D t be the partitioning of the columns of 
A according to topics. There exists a topic k G {1, . . . , t} such that A(M, N) 
and A(T k , D k ) are nearly coincident in the following sense. 

A(i,jY<a A dJ) 2 - 

(i,j)e(MxN)A(T k xD k ) (i,j)eMxN 

Here, X AY denotes the set-theoretic symmetric difference (X — Y) U 
(Y-X). 

The organization of the proof of this theorem is as follows. We first 
analyze the Zipf distribution and propose some assumptions that hold with 
probability tending to 1 as n, L — > oo. Under these assumptions, we make 
some preliminary estimates of norms of submatrices of A. Then we prove a 
sequence of lemmas as follows. 

• In Lemma 3, we establish a lower bound on the optimal value of the 
objective function by analyzing the objective function value with the 
choice M = T k) N = D' k , where D' k C D k are the 'acceptable' docu- 
ments from D k (defined below). 
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In Lemma 4, we establish an upper bound on the contribution from 
unacceptable entries to the optimal solution. 



• In Lemma 5, we deduce as a consequence of the two preceding lemmas 
that heavy acceptable entries must compose a significant portion of the 
optimal solution. Here, an entry A(i,j) is heavy if P(i,k) > x-, where 
Xis a scalar defined below and k is the topic of document j. 

• In Lemma 7, we show that the optimal solution cannot contain heavy 
acceptable entries from two different topics. 

• Thus, the preceding lemmas imply that heavy acceptable entries from a 
single topic k must dominate the optimal solution. Therefore, we show 
in Lemma 8 that the left and right singular vectors of the optimal 
A(M, N) can be estimated from P(M, k) and the vector of lengths of 
documents indexed by N respectively. 

• In Lemma 9, we give a general condition under which adding a row or 
column to M or iV could improve the objective function value. 

• In Lemma 10, we show that any column in D' k satisfies the condition 
given by Lemma 9 (because of the estimate of the left singular vector 
given by Lemma 8), and therefore D' k C N if N is optimal. 

• In Lemma 11, we establish using analogous reasoning that the heavy 
terms H k of topic k must be a subset of M if M is optimal. 

• Finally, the theorem can be proved because all entries of A(M, N) that 
are not from Hj, x D' k are either not heavy or unacceptable but in either 
case, must have small norm. 

We start by stating the inequality that e must satisfy in order for the 
theorem to hold. It should be noted that the constants that follow are quite 
large but are likely large overestimates. Let P mm = min{P(i, k) : i e T k , k = 
1, . . . , t}. Without loss of generality, P m i n > since any row % e T k such that 
P(i, k) = may be removed from T k without affecting the validity of the 
model. The proof requires four parameters, e, 9, <fi and x- The parameters 
depend on a, m, t, and P m i n . They do not depend on n and L (since the 
theorem requires n — > oo and L — > oo). 

First, we define 

X = min(xi,X 2 ), 
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where 



Xi 

X2 



Next, we choose 
where 



0i 

03 



Third, take 
where 



01 

02 
03 



^3/(32 • 16 • 25 2 • 256t)/m, 
^3^/(32 • 512t)/ 

e = mm(e u e 2 ,e 3 ), 
- p ■ ii 

1 mm/ 

-- y/3/(8- 18 • 278 -t)/m, 
= x- V 1 /^ • 278 -mt). 

= min(0i, 02, 3 ), 

3/(16-256 -3-2-25 2 mt), 

3x 2 /(64-6-278-mt), 

a/(32-512mf). 



(7) 
(8) 



(9) 
(10) 
(11) 



Last, 



mm 



in(v/3/(lOm), X ,0). 



(12) 
(13) 
(14) 

(15) 



We start with our assumptions of the form that n or L must be sufficiently 
large. The inequalities in this assumption are needed below. Let nk = \Dk\, 
that is, the number of documents on topic k, k = 1, . . . , t. Let r min = 
min(ri, . . . ,r t ). 

Assumption Al. Let q = log 2 L. Assume n and L are sufficiently large 
so that all of the following are valid. 



mexp( 



< 



L > 3g/(160), 
n > q 3 , 
nr min > 20q. 
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The first two inequalities are lower bounds on L, and the last two say that 
L cannot grow much faster than n. Also, assume L is a power of 4 so that 
q is an even integer. (This last assumptions is not necessary but simplifies 
notation.) 

The next four assumptions are also needed for our analysis and are valid 
with probability tending to 1 as n, L — > oo provided Assumption Al holds. 
The mean value for n k is nr k , so let us impose the following assumption. 

Assumption A2. For each k, rikT k /2 < n k < 2n k r k . 

By the Chernoff-Hoeffding bound and the union bound, this assumption 
will fail with probability at most t exp(— 2nr min ). This quantity tends to 
as n — > oo. 

Next, let us provide some estimates for the Zipf distribution. Let us 
partition D k into subsets C k ,i, ■ ■ ■ , C k , q where C k , L contains documents of 
D k of length [2 b ~ 1 ,2 i ) and q = log 2 L, an even integer by assumption. It 
follows from underestimating the Zipf distribution using an integral that the 
probability that a document lies in C k , h is at least (n k /q — 2)/n k = 1/q — 2/n k . 
We have assumed in A2 that n k > nr min /2 and in Al that nr min /2 > lOq. 
The choice of lengths are independent trials, and the mean size of C k>L is at 
least n k /q — 2, All of these bounds lead to the following. 

Assumption A3. For each k = 1, . . . ,t and u = \,...,q, \C k , L \ > n k /(2q). 

The probability of failure of this assumption is at most gtexp(— n k /(8q)) 
which again tends to zero since n k / (8q) > nr min /( 16q) by A2 and nr min /( 16q) > 
q 2 r min /16 by Al, and finally, q — > oo. A consequence of A3 is that the number 
of documents that have length at least L 1 / 2 (i.e., those in Cfc )9 /2+iU • • -UC ktQ ) 
is at least n k /A. We also need an upper bound on \C ktL \. The mean value of 
this quantity is at most n k /q + 1 using an integral to overestimate the Zipf 
distribution. Since the documents are chosen independently, we obtain the 
following. 

Assumption A4. For each k — 1, . . .,t and i — 1, . . . ,q, \C k ,i\ < 2n k /q. 

The probability of failure is qtexp(—2n k /q 2 ) by the Chernoff-Hoeffding 
bound. Using arguments similar to those in the previous paragraph, this 
tends to as n, L — > oo under Al. 

Let j G D k index a document on topic k whose length we denote as lj. 
The mean value for A(:,j) is ljP(:,k) by the properties of the multinomial 
distribution. Let us now consider the probability that any A(i,j) diverges 
significantly from the mean, e.g., say \ A(i,j) — ljP(i, k)\ > lj6. Again, by the 
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Chernoff-Hoeffding bound, this probability is at most exp(— 21 j9 2 ), so using a 
union bound, the probability that any entry will diverge by lj9 from its mean 
is at most mexp(— 2lj9 2 ). If we further assume lj > L 1 / 2 , this quantity is at 
most mexp(— 2L 1 ^ 2 9 2 ). We have assumed in Al that L is sufficiently large 
so that mexp(— 2L 1 / 2 9 2 ) < 0, where is the parameter given by (12)-(14). 

We say that a column j e D k is acceptable if its length lj is at least 
L 1 / 2 and if the distance of each entry from its mean is at most 91 j. Let 
_D acc denote the subset of {1, . . . ,n} of acceptable documents and D nnacc its 
complement. By the assumptions so far, the number of documents with 
length at least L 1 / 2 in topic k is at least n k /A. Let D' k denote D k n -D aC c, i-e., 
the acceptable subset of D k and let C' k q / 2+ v • • • ■> C k ,q denote the acceptable 
subsets of C k>q /2+\, • • • , Cfc 5 g. We now impose the last assumption. 

Assumption A5. The acceptable subset of each C kjL (l = q/2 + 1, . . . , q) 
has size at least |Cfe it |(l — 20). 

By the union bound, this assumption fails with probability at most 

E exp(-2|^|0), 

fc=l t=g/2+l 

which, according to prior assumptions, is at most (qt/2) exp(— n0r min /(2g)). 
Again, from Al and A2, this probability tends to for large n and L. 

Let us now derive some inequalities useful for the upcoming analysis. A 
simple inequality is 

\\A(-J)\\<h (16) 

which follows because of the inequality ||x|| 2 < ||x||i. Another simple in- 
equality is that if a, b are both nonnegative, then 

(a - bf < a 2 + b 2 . (17) 

Let 1 G R n denote the vector (Z 1; . . . , /„) of document lengths. We now 
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establish some needed norm estimates for 1. 



iiirar = E z 



2 
3 

= EE', 2 

> E E *" 2 

, =q/2+lje c' kL 

t=g/2+l 

> E 2 2 - 2 (l-20)|C fc ,J 

i=g/2+l 

> E 2 2t ~ 2 (l-20K/(2g) 

i=g/2+l 

> (l-20KL 2 /(8g). (18) 

Here, Assumption A5 was used for the fifth line and A3 for second. A useful 
upper bound is: 

iiipoir = E /2 
= EE' 2 
< EE 22 ' 

= E2 2 lC fe ,| 
i=i 

<3 



< E 22t -2n fe /g 



t=i 



< 8n k L 2 /{3q). (19) 
16 



Since ||1|| 2 =||1( J D 1 )|| 2 + ---+||1(A)|| 2 , 

|| 1 1| 2 < 8nL 2 /(3g). (20) 

Some final estimates concern the sum of squares of lengths of unacceptable 
documents. A document can be unacceptable either because its length is less 
than L 1 / 2 (i.e., it lies in C kjL for some k — 1, . . . , t and some i — 1, . . . , q/2) or 
else because its term frequencies deviate too much from the mean (by more 
than Ijd in some position). In the former case, all document lengths are 
bounded by L 1 / 2 , hence squared document lengths are bounded by L. For 
the latter case, we can apply Assumption A5. Thus, we have the following 
estimate on unacceptable documents: 



|2 



\\l(D k - D> k )\\> = ||1(C M U---UC M/2 )|| 2 

+ ||l(Cfc, ? /2+l U • • • U C k , q ~ C' k q/2+l C'k,q)\\ 

< n k L+ \C k ,,-C' k j2^ 

L=q/2+l 

q 

L=q/2+l 

< n k L + 16(pn k L 2 / (3q). 

Here, Assumption A3 was used for the fourth line. We can combine these 
contributions from individual topics to obtain the upper bound: 

||lpu„acc)|| 2 < nL + 160nL 2 /(3g). (21) 
These estimates can be extended to sum of squares of the entries of A: 

m 

\\A(:,D k -D' k )\\l = 5>(^') 2 

jeD k -D' k i=i 

< ^ ' 2 



3 

jeD k -D' k 
< n k L + 16(pn k L 2 / (3g). 

The second line follows from (16). 
Thus, 

\\A(:,D unacc )\\ 2 F <nL + 160nL 2 /(3g). 
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Recalling from Assumption Al that L > 3g/(160), the second term domi- 
nates in the above four inequalities, so 

||l(Ac-^)f < 320n fe L 2 /(3g), 
\\A(:,D k -D' k )\\ 2 F < 32<Pn k L 2 /(3q), 

||lpunacc)|| 2 < 320nL 2 /(3g), (22) 

\\A(:,D unacc )\\ 2 F < 320nL 2 /(3g). (23) 

Because of (12), 

||l(D fc -^)|| 2 < n fc L 2 /(16-256-25 2 -gtm), (24) 

\\A(:,D k - D' k )\\ 2 F < n k L 2 / (16- 256 • 25 2 ■ qtm), (25) 

||l(Amacc)|| 2 < nL 2 /(16-256-25 2 -gtm), (26) 

\\A(:,D unacc )\\ 2 F < nL 2 /(16-256-25 2 -gtm). (27) 

With these preliminary inequalities in hand, we may now begin the first 
lemma in the proof of the main theorem. 

Lemma 3. Under Assumptions A1-A5, 

f opt > nL 2 /(25Qqtm), (28) 
where f opt denotes the optimal value of (1). 

Proof. The proof follows from estimating the value of the objective function 
for the choices M = T k and N = D k . We can estimate the first term in (1) 

as 

\\A(M,N)\\ 2 F = £$>(*, jf 

jeD' i e T k 



> 



> 



i&T k 

> (l-2 ( p)n k L 2 \\P(T k ,k)\\ 2 /(32q) 

> (1 - 2<p)n k L 2 1 \<6Aqm) 

> n k L 2 /(128qm). (29) 
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The second line follows by the definition of 'acceptable.' The third follow 
because 9 < P(i,k)/2 by (9). The fifth line relies on (18), the next on the 
fact that \\P(T k , k)\\ 2 > l/(2m) because ||P(:,£;)|| 2 > 1/m (which follows 
from ||P(:,fc)||i = 1) and ||P({1, . . . , m} - T k , k)\\ 2 < me 2 < 3/(10m) from 
(15). The last line uses < 1/4 because of (12). 

Now we turn to the second term in (1). Choose u, v,er so that u<xv T = 
P(T k , k)\(N) T , a rank-one matrix, where as above N = D' k . Since \A(i,j) — 
ljP(i,k)\ < lj9 when j is acceptable, we have the following bound for the 
second term. 

1 \\A(M,N)-ua^ T \\ 2 F = 7 ££(A(i,. 7 -)-Z i P(i,*)) 2 

jeD' k i€T k 

= 7 ^||l(L> fe )|| 2 -|T fc | 

< 8-f9 2 mn k L 2 /{3q) 

< n k L 2 /{256qm). 

The fourth line follows from (19), and the last follows from (10) (taking 
7 = 4). Thus, subtracting the above right-hand side from (29) shows that 
f opt > n k L 2 / (256gm). This inequality is valid for all k — 1, . . . , t, so we may 
assume it is true for the k that maximizes n k . This value of n k is therefore 
at least n/t. This establishes (28). □ 

For a particular topic k, say that a term index i e T k is heavy if P(i, k) > 
X, where x was defined by (7)-(8) above. Let H k C T k be the heavy indices. 
We use the notation top(j) to denote the topic of document j, j G {1, . . . , n}. 
Say that an entry A(i,j) of A(M,N) is a heavy entry if % e H k where k = 
top(j). Finally, say that an entry A(i,j) of A(M, N) is acceptable and heavy 
if it is heavy and j is acceptable. 

Lemma 4. Under Assumptions A1-A5, the sum of squares of entries of A 
that are not heavy but are acceptable is at most nL 2 /(16 • 256 • 25 2 qtm). 
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Proof. This is a straightforward estimate: 



t 

2 



E A ^ = EEE mu) 

A(i,j) not heavy & j acceptable fc=l j£D' k i^H^. 



t 



fe=i jez^ i^H fe 

< EEE'f(x-+") 2 

k=l j£D' k i$H k 



t 

fc=i jeDj; 

\ 2 1 1 1 1 1 2 



= (x + ^) 2 ]T(m-|tf fc |)0 



< m(x + ^) 2 ||l| 

< 8m( X + 6) 2 nL 2 /(3q) 

< 32m X 2 nL 2 /(3q) (30) 

< nL 2 /(16-256-25 2 gtm). (31) 

The second line follows by definition of 'acceptable.' The third follows be- 
cause P(i, k) < x if i is not heavy in topic k. The sixth line follows from 
(20), the seventh because 9 < x (refer to (11)) and the last from (7). □ 

Lemma 5. Under Assumptions A1-A5, The sum of squares of acceptable 
and heavy entries in A(M,N), where M,N are the optimizers of (1), is at 
least nL 2 /(512qt). 

Proof. The sum of squares of entries in A(M, N) from unacceptable docu- 
ments is bounded above by the sum of squares of entries in A of unacceptable 
documents, for which we have the estimate given by (27). The sum of squares 
of entries of A(M, N) which are acceptable but not heavy is bounded above 
by the same quantity for all of A, which is given by (31). Adding these 
two upper bounds gives a quantity less than half of the lower bound in (28), 
which proves the result. □ 

The following lemma is stated more generally than the others of this 
section (i.e., without Assumptions A1-A5 and without assuming 7 = 4) 
because it is more broadly applicable. 
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Lemma 6. Let A be an m x n matrix with nonnegative entries. Let M C 
{1, . . . , m} be the optimizing choice of M for (1). Assume 7 > 2. Let j,j' 
index two columns of A such that 



Then at least one of j or f is not a member of the optimizing choice of N. 

Remark. The lemma is also true when the roles of M and N are reversed 
since the value of the objective function (2) is unchanged under matrix trans- 
position. 

Proof. Let unit vector u be the optimizing choice for (1). As noted in 
Lemma 1, j and j' are included in the optimal N provided fj, fy > 0, where 



Here, we have simplified notation by allowing u to stand for u(M). We will 
now show that for any possible choice of u, either fj < or fy < 0, meaning 
that at least one of j or j' cannot be in N. 

To proceed, let us define normalizations r = A(M, j)/\\A(M, j)\\ and s = 
A(M,j')/\\A(M,j')\\. With these definition, (32) is rewritten r T s < 1 - 2/7. 
Since multiplying by a positive scalar does not affect the signs of fj or fj>, it 
suffices to redefine them using the normalized vectors: 



A(M,j) T A(M,f) 



< 1 - 2/7. 



(32) 



\\A(M,j)\\.\\A(M,j>)\\ 



fj = 7(^(M,j) T u)) 2 -(7-l)P(M,j)ir, 
^ = 7 (A(M, J "fu)) 2 -( 7 -l)P(M, J ")|| 2 . 



fj = 7 (r T u) 2 -7 + l 



and 



^= 7 (s T u) 2 -7 + l. 



Thus, 



fi + fi' = 7(r T u) 2 + 7 (s T u) 2 - 2 7 + 2 



(33) 




(34) 
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In this inequality we used the notation A max to denote the maximum eigen- 
value of a symmetric matrix. We also used the identity that for any matrix 
B, \\B\\ 2 = (\ m a,x(BB T )) 1 / 2 . The eigenvalues of the 2x2 matrix above can 
be easily determined as 1 ± r T s. Thus, 

fj + ff < (1 + r T s) 7 - 2 7 + 2 = (r T s - 1) 7 + 2. 

Since r T s < 1 — 2/7, the right-hand side is negative, thus showing that either 
fj or fji is negative. □ 

We can now apply the previous lemma to the text corpus under analysis. 

Lemma 7. Assume A1-A5 hold. Suppose that (i',f) are two acceptable 
heavy entries in the optimal solution (M,N). Then and (i',f) must be 
from the same topic k. 

Proof. Suppose that is an acceptable heavy entry on topic k, and 
is an acceptable heavy entry topic k! such that k' 7^ k. Suppose also that 
both G M. We will prove that either j or / is not in N. Let r = 
A(MJ)/\\A(MJ)\\ and s = A(MJ')/\\A(M,j')\\. Let us split r and s into 
three subvectors: r^Si contain those entries indexed M fl T k ; r 2 ,s 2 contain 
entries indexed by M n T k >; and r 3 , s 3 contain the remaining entries of M. 
Since j G T k , is heavy, and j is acceptable, this means that i e M HT k 
and A(i,j) > lj( X - 0), so that \\A(M n T k ,j) || 2 > l]( X - Of. Since 9 < 
x/2 (refer to (11)), this quantity is at least /|x 2 /4- On the other hand, 
||A(M — T k) j)\\ 2 < ml 2 (e + 9) 2 since column j is acceptable and P(i, k) < e 
for i T k . Thus, after rescaling, 

||[r 2 ;r 3 ]|| = ||r(M-T fc )|| 

= \\A(M-T k ,j)\\/\\A(M,j)\\ 

< \\A(M-T k ,j)\\/\\A(MnT k ,j)\\ 

< 2m l l 2 {e + 6)/ X . 

The inequality \ ^ 10m 1//2 (e + ff) follows from (11) and the fact that e < 9 
from (15). Thus, 1 1 [r 2 ; r 3 ] 1 1 < 1/5. Similarly, 1 1 [si ; S3] 1 1 < 1/5. Hence, 

|r T s| < |r^Si| + |r^s 2 | + |r|"s 3 | 

< ||si|| + ||r 2 || + ||r 3 || • ||s 3 || 

< 1/5 + 1/5 + 1/25 = 0.44. 

Thus, by Lemma 6, since 0.44 < 1 — 2/7 when 7 = 4, either j or j' is not 
present in the optimal N. □ 
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The next lemma shows that the left and right singular vectors of the 
optimal solution to (1) are determined largely by the document lengths and 
probability distribution for topic k. 

Lemma 8. Let (M,N) be the index sets that optimize (1). Let k be the 
index of the topic of the heavy entry occurring in (M,N) (which is uniquely 
determined according to Lemma 7). Let r and s be the right and left singular 
vectors of A(M,N), respectively. Assume A1-A5 hold. Then there exists a 
positive scalar K r such that 

H^ r - r 4 <i/ 6 , (35) 



l r o| 



where r is defined by 



Similarly, there exists a positive scalar k s such that 

ll^~ s °ll < 1/6, (37) 

Nil 

where s = P(M,k). In addition, r satisfies the following inequality: 

||r || 2 > nL 2 /(278qtm) (38) 



Proof. Let B be a matrix with the same dimensions as A(M,N), indexed 
the same way A is indexed (i.e., B and B(M,N) denote the same matrix) 
and whose entry is B(i,j) = P(i,k)ro(j). We will also use r and 

r (N) synonymously, and similarly for r, s and s . Observe that B is a rank- 
one matrix since B = P(M, k)r^. The only nonzero singular value of B is 
||r |H|P(M,AO||. 

Let us partition into three sets Ni U N 2 U N 3 given by 

JVi = Nf]D unacc ; N 2 = Nn(D acc -D' k ); N 3 = N n D' k . (39) 

From this partition, 

\\B-A(M,N)\\l < \\B-A(M,N)\\ 2 F 

= \\B(M, - A(M, NJWl + \\B(M, N 2 ) - A(M, N 2 )\\ 2 F 
+ \\B(M,N 3 )-A(M,N 3 )\\ 2 F 
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where we now obtain upper bounds on the three terms individually. 



||-B(M, TV!) — A(M, Ni)\\ 2 F < \\B(M : N 1 )\\ 2 F + \\A(M, N^l 

= n^M.^iHii^nDoir+EE^j) 2 

jeNi ieM 

< Hlpunacc)!! 2 +\\A(:,D mmcc )\\ 2 F 

< 2nL 2 /(256-3-2-25 2 gtm). 

In the above derivation, we used (17) for the first line, the relationship iVi C 
-Dunacc and ||P(:, k)\\ < 1 for the third line, and (26) and (27) for the last. 
Next, observe that B(M, N 2 ) = since N 2 H D k = 0, hence 

\\B(M,N 2 ) - A(M,N 2 )\\ 2 F = \\A(M,N 2 )\\ 2 F . 

All entries of the right-hand side are acceptable and not heavy; they are 
acceptable by choice of N 2 , and they are not heavy because Lemma 7 shows 
that there cannot be an acceptable heavy entry from a topic other than k in 
the optimal solution. Thus, from (31), 

\\B(M,N 2 ) — A(M, N 2 )\\ 2 F < nL 2 /(lQ-25Q-25 2 qtm). 

Finally, for j e N 3 , r (j) = lj since N 3 C D k . Thus, 

\\B(M,N 3 )-A(M,N 3 )\\ 2 F = Y,Y.^P(hk)-A{i,j)f 

jeN s ieM 

£ EE'; 2 * 2 

jeN 3 ieM 
= |M|^ 2 ||l(iV3)|| 2 

< m0 2 ||l(L> fc )|| 2 

< 8m9 2 n k L 2 /(3q) 

< nL 2 /(3 ■ 25 2 • 256qtm) 

Here, the definition of 'acceptable' was used for the second line, and (19) was 
used for the fifth line and (10) for the last line. 

Thus, we see that \\B(M,N) - A(M,N) \\ 2 < nL 2 /(25 2 • 256qtm). On 
the other hand, \\A(M,N)\\ 2 > nL 2 /(256qt) by (28). Thus \\B(M,N) - 
A(M,N)\\/\\A(M,N)\\ < 1/25. This means by the triangle inequality that 
\\B(M,N)-A(M,N)\\/\\B(M,N)\\ < 1/24. 
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Now we can apply Theorem 8.6.5 of Golub and Van Loan [9] on the 
perturbation of singular vectors to conclude that the normalized left singular 
vector of A(M, N) differs from r*o / 1 1 1*0 1 1 by at most 1/6. (Note that in applying 
the theorem, we use the fact that the second singular value of B(M,N) is 
zero since B(M,N) has rank one.) Similarly, the normalized right singular 
vector of A{M,N) differs from s /||s || by at most 1/6. 

Finally, to establish (38), we observe that 

||r || 2 = ||5|| 2 /||P(M,A;)|| 2 

> (24/25) 2 nL 2 /(256gtm). 

which implies (38). The first line follows because B is rank-one, and the 
second because ||S|| > (24/25)||A(M, N)\\ as established above, and ||P(: 
,k)\\ < 1 since ||P(:, Jfe)||i = 1. □ 

The following lemma is used to determine when adding a row or column 
to the sets M or N will increase the objective function (1). 

Lemma 9. Let u be a nonzero vector and d 1; d 2 perturbations such that 
||di|| < ||u||/6 and ||d 2 || < ||u||/6. Suppose a = «i(u + di) and b = «2(u + 
d 2 ) ; where k±,K2 are positive scalars. Then 

a T a- 7 ||a-/?b|| 2 > 

for 7 = 4 and for at least one choice of f3. 

Proof. Let us take (5 = ki/k 2 . Then 

a T a-7||a-/5b|| 2 = /t 2 (u + d : ) T (u + di) - 7||«i(u + di) - «i(u + d 2 ) || 2 
= k\ [u t u + 2u T di + (1 - 7)dfdi - 2 7 df d 2 - 7 dnf d 2 ] 
= k\ [u t u + 2u T di - 3d[dx - 8d[d 2 - 4d^d 2 ] 

> k\ [||u|| 2 -2||u|| • ||di|| -3||di|| 2 -8||di|| • ||d 2 || -4||d 

> k 2 ||u|| 2 (1 - 2/6 - 3/36 - 8/36 -4/36) 

> 0. 

□ 

The point of Lemma 9 is as follows. Suppose (M, N) is a putative solution 
for maximizing (1) and j ^ N. Suppose the right singular vector of A(M, N) 
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is s, and suppose that A(M,j) and s are both within relative distance of 
1/6 from another vector P(M,k) after rescaling, where k is the topic of 
column j. Recall that once M and u are fixed, the objective function of (1) 
becomes separable by columns, i.e., it is possible to choose the jth entry of v 
considering only the contribution of column j to the total objective function. 
The previous lemma says that there is a way to choose Vj so that (M, NU{j}) 
has a higher objective function value than (M,N), where we take the same 
M, u, a and extend v with the particular choice of Vj. This means that in 
fact N is not optimal, because it should also include j. The lemma can also 
be used on rows using the analogous argument. 

Now let us apply this lemma to deduce the contents of the optimal M 
and N. 

Lemma 10. Assume A1-A5 hold. In the optimal solution, D' k C N . 

Proof. Let us consider a column j G D' k , that is, an acceptable column for 
topic k. Observe that A(M,j) = lj(P(M, k)+d 2 ), where each entry of d 2 has 
absolute value at most 9 by definition of 'acceptable.' Now we observe that 
||d 2 || < 9m 1 ' 2 < x/6 < \\P(M, k)\\/6; the first follows because ||d 2 ||oo < 0, 
the second follows from (11), and the third follows because M contains at 
least one heavy row of k. 

Thus, A(M,j) = lj(P(M,h) + d 2 ) with d 2 < ||P(M,ife)||/6 and the left 
singular vector s of A(M,j) satisfies s = (P(M ) k) + di)//c s , with ||di|| 2 < 
||P(M,fc)||/6by(37). 

Thus, by Lemma 9, column j G D' k increases the value of the objective 
function since A(M,j) and the left singular value of A(M, N) are both scalar 
multiplies of perturbations of P(M,k), where the relative perturbation size 
is at most 1/6. This proves that all columns of D' k will lie in N. □ 

Recall that H k denotes the subset of T k of heavy rows (terms) associated 
with topic k. 

Lemma 11. Assume A1-A5 hold. In the optimal solution, H k C M. 

Proof. Let us consider a row i G H k . Let us write A(i, N) = P(i, k)(r + d 2 ) 
and try to estimate d 2 . By definition of H k , P(i, k) > x- We can obtain an 
upper bound on d 2 = A(i, N)/P(i, k) — r (N) as follows. Use the partition 
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of N given by (39). Then 

kmh 2 < n^ivoir/^ar+iirowii 2 

< (Vx 2 )l|i(iVi)ll 2 + l|i(^i)ll 2 



< (i + i/ x 2 )||ip 



2MI1/7-) M|2 
r2 



< 320(1 + l/ X 2 )nL 2 /(3q), 

using (17) for the first line, (36), (16) and P(i,k) > x f° r the second line, 
iVi C .Dunacc for the third, (22) for the fourth. 
Next, 

||d 2 (iV 2 )|| 2 = \\A^N 2 )/P(t,k) 2 -r (N 2 )\\ 2 

= \\A(i,N 2 )\\yp(i,k) 2 

= (l/P(i,kf)J2Mh3) 2 

jeN 2 

< (l/P(.,A;) 2 )^/|(P(.,top( J )) + ^) 2 

jeN 2 

jeN 2 

< ((t+o)/x) 2 \\n 2 

< 8((e + 9)/x) 2 nL 2 /(3q). 

For the second line we used the fact that r (iV2) = 0, which follows from (36) 
and (39). For the fourth line we used the fact that j is acceptable. For the 
fifth we used P(i, k) > x and P(i, top(j)) < e since % e H k C T k and j £ D k . 
For the last line we used (20). 
Finally, 

l|d 2 (iV3)|| 2 = £(A(i,j)/P(i,*)-l;) 2 

jeN 3 

< 

jeN 3 

< o 2 \\KD k )\\ 2 

< 86 2 n k L 2 /(3q). 

The second line follows because N 3 C D acc . The last line follows from (19). 
Thus, 



2 32<j)nL 2 8(e + 6) 2 nL 2 86nL 2 



l|d 2 ||<(l + l/x 2 )^^ + ^i + 

3q x 2 ■ 3g 3q 



27 



Now apply (13) to the first term (plus the fact (1 + 1/x 2 ) < 2/x 2 ), (11) to 
the second and (10) to the third term to conclude that 



" 6 • 278 • qmi 

Comparing to (38), ||d 2 || < ||r ||/6. Thus, A(i,N) is a perturbation of r of 
relative size at most 1/6. By (35), the right singular vector of A(M,N) is 
also a perturbation of r of relative size at most 1/6. By Lemma 9, inserting 
% into M can only increase the objective function value. □ 

Now finally we can prove Theorem 2. 

Proof. Consider an entry € (T k x D k ) A (M x N). We take two cases: 
the first case is € (T k x D k ) — (M x N). In this case, since H k C M and 
D' k C N as proved in the two preceding lemmas, it must be the case that 
either j e D k — D' k or i e T k — H k , i.e., either the entry is in an unacceptable 
column or it is a acceptable but not heavy. 

The second case is € (M x N) — (T k x D k ). Thus, either j is on 
a topic other than k (i.e., j (jL D k ), or it is on topic k but is not a heavy 
entry (i.e., j G D k but i T k , so i ^ H k ). Thus, either j is an unacceptable 
column, or else % is not a heavy entry, because if by Lemma 7, A(M, N) 
cannot contain any acceptable and heavy entries except on topic k. 

Thus, we see that all entries indexed by (7^ x D k ) A (M x N) are either 
unacceptable or not heavy. The maximum norm of unacceptable entries is 
given by 

\\A(:,D unacc )\\ 2 F < 320nL 2 /(3g) 

< anL 2 /(512qtm), 

where the first line comes from (23) and the second from (14). 

The maximum norm of entries that are acceptable but not heavy is 

Mhjf < S2m X 2 nL 2 /(3q) 

A(i,j) not heavy & j acceptable 

< anL 2 /(512qtm), 

where the first line comes from (30) and the second from (8). Thus, adding 
the two previous inequalities shows that the sum of entries indexed by the 
symmetric difference (T k x D k ) A (M x N) is at most anL 2 / (256qtm) . This 
is a fraction of at most a times the optimal value as shown by (28). □ 
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6 Behavior of the objective function on de- 
composable bitmap images 

We consider the behavior of objective function (1) on decomposable bitmap 
images. A bitmap image is one in which each pixel is either white (0) or black 
(1). Suppose that A is an m x n matrix encoding a family of images; here m 
is the number of pixels per image and n is the number of images. Since the 
images are assumed to be bitmaps, every entry of A is either or 1. 

A collection of images is decomposable if there exists a partitioning of the 
pixel positions {1, . . . , m} into t subsets Ti, . . . , T t , called features, such that 
for every k, every image is either black in all of T*. or is white in all of T^. 
Clearly any collection of images is decomposable into individual pixels (i.e., 
Ti = {1}, T2 = {2}, etc.), so the interesting case is when t <C m. Donoho 
and Stodden [7] have considered a particular kind of decomposable bitmap 
image database. 

We now consider a simple probabilistic model of generating a database 
of n decomposable bitmap images (i.e., an m x n matrix) and prove that the 
objective function (1) is able to identify a feature in the database with high 
probability. There are many other ways to define a model for which a similar 
theorem could be proved. 

Theorem 12. Let Ti, . . . , T t , the features, be a partition of {1, ... , m} with 
t > 1. Let m m i n ,m max denote minjt =1 .. ^ \Tk\, maxfc =1 ... t |T^| respectively. 
Let I be an integer in 1, . . . ,t/2. Assume that each of the n images in the 
matrix A is generated independently by selecting exactly I features uniformly 
at random out of the possible t. Finally, assume that 7 > 4m/m m j„. Then 
with probability tending to 1 as n — > oo ; the optimizer of (1) applied to this 
A will select M = Tk for some k such that \Tk\ — rn m!lx . 

Proof. Let (M,N) be the optimizing solution of (1). Observe that, for any 
k, all the rows of A indexed by Tk are identical by construction. Therefore, 
it follow from (3) that if any row from Tk lies in M , then all of T& must be 
included in M since all have the same g« value. Thus, M is a union of some 
of the Tfc's. 

We claim that it is impossible that the optimal N contains two columns 
j and j' such that bitmap j contains feature Tk for Tk C M while f does not 
contain T*.. The reason is that in this case, A(M,j) consists of a vector with 
l's in positions indexed by T k and while A(M,j') has 0's in these positions. 
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Let mi be the number of '1' pixels in j and m 2 be the number of '1' pixels 
in j'. Observe m\ < m and m 2 < m. Let q = \T^\ so that q > m min . 
Then the number of '1' pixels in common between images j and j' is at most 
min(mi — q, 7772). Consider the left-hand side of (32): 

A(M,j) T A(M,f) 
\\A(M,j)\\.\\A(M,j>)\\ ~ 



< 



< 



< 
< 

By assumption, 7 > 4m/m m i n , so the right-hand side of (40) is less than 
1 — 2/7. Thus, by Lemma 6, not both j and j' can be in the optimal choice 
of N. 

Thus, we conclude that all columns taking part in the optimal solution 
must have all l's (or all 0's) in positions indexed by M. Ignore the columns 
of all 0's since their presence does not affect the objective function value. 
Consider now a feature k such that |T^| = m max . Feature k is expected to 
occur in the fraction l/t of columns of A. For any e > 0, by choosing n 
sufficiently large, we can assume with probability arbitrarily close to 1 that 
this choice occurs in the fraction at least l/t — e of the columns. Therefore, 

f(T k ,N)>n(l/t-e) (41) 

for any e > and n sufficiently large, where N is the set of columns containing 
feature k. 

Now consider any other possible choice of M; suppose e.g., that M has 
s of the features. By the preceding argument, the optimal choice of iV that 
could accompany this M contains only columns that use all s features. This 
union of s features is expected to occur in the fraction 

1(1- 1) •••(/- s + 1) 
t(t-l)---(t- s + 1) 



min(mi — q, m 2 ) 



9 , if mi — q < 777.2, 
; 1712 , if 7771 — q > m 2 

/ m )~ q = , if 777i - g < 777 2 , 
y'm 1 (m 1 -q) 

^= i : if 777i - g > 777 2 

y/l - q/m 1 

a/1 - 777 min /777 

l- 777 min /(2777). (40) 
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of the columns. Thus, for any e > 0, for n sufficiently large, 

f(M,m, n <%XU^ c -- (42) 

(The factor sm max is the maximum contribution to ||A(M, iV) || from a par- 
ticular column of N.) Now it is a simple matter to check that for any positive 
integers I, t such that I < t/2 and s < I, 

f(f-l)...(f_ 8 + l)) i 
t(t-l)---(t- s + 1) ~ t 

with strict inequality for s > 1. Thus, by comparing (41) with (42), we 
conclude that as n — > oo, with probability tending to 1, f(Tk,N) is the 
optimal value of the objective function. □ 

It should be noted that the previous theorem states that the optimal M 
includes a single feature k but says nothing about the optimal N. Indeed, 
as noted in the proof, we can take the optimal TV to be {1, . . . , n}. In some 
situations it might be desirable for the optimal N to include only those 
images that use feature k. This can be achieved by including a penalty term 
(6) into the objective function in which p is chosen to be a very small positive 
coefficient. 

7 On the complexity of maximizing /(M, N) 

In this section, we observe that the problem of globally maximizing (2) is 
NP-hard at least in the case that 7 is treated as an input parameter. This 
observation explains why RID settles for a heuristic maximization of (2) 
rather than exact maximization. First, observe that the maximum biclique 
(MBC) problem is NP-hard as proved by Peeters [16]. We show that the 
MBC problem can be transformed to an instance of (2). 

Let us recall the definition of the MBC problem. The input is a bipartite 
graph G. The problem is to find an (m, n)-complete bipartite subgraph 
K (sometimes called a biclique) of G such that van is maximized, i.e., the 
number of edges of K is maximized. 

Suppose we are given G, an instance of the maximum biclique problem. 
Let A be the left-right adjacency matrix of G, that is, if G — (U, V, E) where 
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UUV is the bipartition of the node set, then A has \U\ rows and \V\ columns, 
and A(i,j) = 1 if G E for i G U and j G V, else A(i,j) = 0. 

Consider maximizing (2) for this choice of A. We require the following 
preliminary linear-algebraic lemma. 

Lemma 13. Let A be a matrix that has either of the following as a submatrix: 



Then a 2 (A) > 0.618. 

Proof. First, observe that if U is a submatrix of A, then \\U\\2 < ||^4||2- This 
follows directly from the operator definition of the matrix 2-norm. 

Next, recall the following preliminary well known fact (Theorem 2.5.3 of 
[9]): For any matrix A, a^A) = mm{\\A-B\\ 2 : rank(5) =i-l}. This fact 
implies the following generalization of the result in the previous paragraph: if 
U is a submatrix of A, say U = A(M, N), then for any i, o~i(U) < o~i(A). The 
reason is that if B G Rl M l x l Jv l is a rank-/c matrix, then B G R mxri defined by 
padding with zeros is also rank k, and \\A — B\\ 2 > \\U — B\\ 2 by the result in 
the previous paragraph. Now finally the lemma is proved, since a 2 (Ui) = 1 
and a 2 (U 2 ) > 0.618. □ 

This lemma leads to the following lemma. 

Lemma 14. Suppose all entries of A G R mxri are either or 1, and suppose 
and at least one entry is 1. Suppose M, N are the optimal solution for maxi- 
mizing f(M,N) given by (2). Suppose also that the parameter 7 is chosen to 
be 2.7mn + 1 or larger. Then the optimal choice of M, N must yield a matrix 
A(M, N) of all 1 's, possibly augmented with some rows or columns that are 
entirely zeros. 

Proof. First, note that the optimal objective function value is at least 1 
since we could take M = {i} and iV = {j} where are chosen so that 
A(i,j) = 1. In this case, f(M,N) = 1. 

Let (M,N) be a pair of index sets that is a putative optimum for (2). 
Suppose A(i,j) = 0, where (i,j) G M x N. One possibility is that either row 
% or column j is entirely made of 0's, in which case (M, N) conforms to the 
claim in the lemma. The other case is that A(i,j) = and yet there is an 
i! G M and f G N such that A(i',j) = A(i,f) = 1. In this case, A(M,N) 




(43) 
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has as a submatrix (using rows and columns {j',j}) one of the two 

special matrices U\ or XJ 2 from (43). Thus, <J2{A(M, N)) > 0.618. On the 
other hand, a 1 {A{M,N)f < \\A(M,N)\\ 2 F < mn. Therefore, f(M, N) < 
mn — (7 — 1)(0.618 2 ) < since (7 — 1) (0.618) > mn by choice of 7. In 
particular, this means (M, N) cannot be optimal. □ 

If A(M, N) includes a row or column entirely of zeros, then this row or 
column may be dropped without affecting the value of the objective func- 
tion (2). Hence it follows from the lemma that without loss of generality 
that the optimizer (M,N) of (2) indexes a matrix of all l's. In that case, 
a 1 (A(M : N)) = y/\M\- \N\ while a 2 (A(M,N)) = ■■■ = a p (A(M,N)) = 
(where p = min(|M|, \N\)), and hence f (M, N) = \M\ ■ \N\. Thus, the value 
of the objective function corresponds exactly to the number of edges in the 
biclique. This completes the proof that biclique is reducible in polynomial 
time to maximizing (2). 

We note that Gillis [8] also uses the result of Peeters for a similar purpose, 
namely, to show that the subproblem arising in his NMF algorithm is also 
NP-hard. 

The NP-hardness result in this section requires that 7 be an input pa- 
rameter. We conjecture that (2) is NP-hard even when 7 is fixed (say 7 = 4 
as used herein). 

8 Image database test cases 

9 Text database test cases 

10 Conclusions 

We have proposed an algorithm called RID for nonnegative matrix factor- 
ization. It is based on greedy rank-one downdating according to an objective 
function, which is heuristically maximized. We have shown that the objective 
function is well suited for identifying topics in the e-separable text model and 
on a model of decomposable bitmap images. Finally, we have shown that the 
algorithm performs well in practice. 

This work raises several interesting open questions. First, the e-separable 
text model seems rather too simple to describe real text, so it would be 
interesting to see if the results generalize to more realistic models. 
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One straightforward generalization is to consider power-law models for 
document lengths that generalize the Zipf law: suppose that the probability 
of length I occurring is proportional to l~ p for some p. Our proof of Theorem 2 
generalizes to cover the case < p < 1 without too much difficulty. However, 
proving the theorem in the case p > 1 appears to be much more difficult (and 
perhaps the theorem is not true in this case). When p > 1, short documents 
dominate the corpus, and short documents are not easily analyzed using 
Cher noff-Hoeff ding bounds. 

A second question is to generalize the model of image databases for which 
a theorem can be established. 

A third question asks whether a result like Theorem 2 will hold for the 
RID algorithm using our proposed definition of the heuristic subroutine 
ApproxRankOneSubmatrix. When ApproxRankOneSubmatrix. is applied to 
an e-separable corpus, does it successfully identify a topic? Here is an ex- 
ample of a difficulty. Suppose n — > oo much faster than L. In this case, the 
document j with the highest norm will be the one in which lj is very close to 
L and in which one entry A(i, j) is very close to L while the rest are mostly 
zeros. This is because the maximizer of ||x|| 2 subject to the constraint that 
||x||i = C occurs when one entry of x is equal to C and the rest are zero. It 
is likely that at least one instance of a such a document will occur regardless 
of the matrix P(-, •) if n is sufficiently large. This document will then act as 
the seed for expanding M and N, but it may not be similar to any topic. 

The scenario described in the preceding paragraph can apparently be 
prevented by requiring n and L to grow proportionately, but the analysis 
appears to be complicated in this case. If we assume that the initial column 
j selected by RID is well approximated by ljP(:,k) for some k (i.e., the 
column is 'acceptable' in the terminology of Theorem 2), then the rest of M 
and N is likely to be the topics and terms associated with topic k. This is 
because Lemma 7 indicates that it is unlikely that a column or row associated 
to another topic can improve the objective function (1), whereas Lemmas 10 
and 11 indicate that a document or term associated with topic k will be 
favored by (1). 
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